Week 4: Clustering and classification

This week’s topic was about exploring statistical data

The data

I load the Boston data from MASS libray into R. Boston is default data set in R.

library(dplyr);library(tidyr);library(ggplot2)
library(boot);library(MASS);library(corrplot)
library(plotly)
#library(tidyverse)

#Load the Boston data from the MASS package. 
data("Boston")

The Boston data set contains information about factors effecting housing values in suburban areas of Boston. The data includes e.g crimerate (crim), full value property tax rate per 10000$ (tax) and pupil- teacher ratio (pratio).

#Explore the structure and the dimensions of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

From the summary, we can see that all the variables in this data frame are numeric.

# The correlation matrix 
cor_matrix<-cor(Boston) %>% round(digits = 2)

# Plotting the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b",tl.pos = "d",tl.cex = 0.6)

Above is the correlation chart of the values. In there it’s visible that rad (index of accessibility to radial highways) correlates positively to dis (weighted mean of distances to five Boston employment centres) and lstat(lower status of the population (percent)) correlates positively with medv (median value of owner-occupied homes in $1000s).

The data must be scaled beore doing any further analysis from it.

boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

Crime rates are categorized to four categories

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low","med_high","high"))
# save it
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Randomly pick training and testing data

# choose randomly 80% of the rows
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)

# create train and test set
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Linear discriminant analysis, in which I use categorical crime rate as the target variable and all the other variables as predictor variables.

lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2599010 0.2475248 0.2400990 
## 
## Group means:
##                   zn       indus         chas        nox         rm
## low       0.97941010 -0.98171566 -0.117932980 -0.8756611  0.5095695
## med_low  -0.08031574 -0.23590590 -0.009855719 -0.5304511 -0.1707712
## med_high -0.37404455  0.08896355  0.200122961  0.3222790  0.1256098
## high     -0.48724019  1.01721867 -0.109974419  1.0597364 -0.4384570
##                 age        dis        rad        tax    ptratio      black
## low      -0.8673216  0.8996590 -0.6947544 -0.7406985 -0.4386490  0.3826578
## med_low  -0.2637219  0.3359894 -0.5443600 -0.4568671 -0.0670028  0.3272386
## med_high  0.3414028 -0.3316590 -0.3766291 -0.3229980 -0.2639944  0.1173009
## high      0.8376067 -0.8700253  1.6371072  1.5133254  0.7795879 -0.8789726
##                 lstat       medv
## low      -0.814431632  0.5883712
## med_low  -0.067085718 -0.0159933
## med_high -0.003915838  0.2116094
## high      0.957036476 -0.7610892
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.122311233  0.76129508 -0.63606625
## indus    0.008265984 -0.42176411  0.58482782
## chas    -0.095586675 -0.08728622  0.01938195
## nox      0.472665518 -0.63403754 -1.62060573
## rm      -0.115951055 -0.03241037 -0.19573621
## age      0.280737808 -0.20704065 -0.05850912
## dis     -0.059128393 -0.27690382  0.07237108
## rad      3.036995267  0.75458201  0.08371175
## tax     -0.071074574  0.16243360  0.39284080
## ptratio  0.118185684  0.06767660 -0.36671221
## black   -0.148982845 -0.02962798  0.14013688
## lstat    0.209399512 -0.24494411  0.29511596
## medv     0.193628064 -0.40787376 -0.32877666
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9477 0.0371 0.0151

Plotting LDA results with a biplot

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

Calculate LDA predicting performance

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14      11        0    0
##   med_low    6      14        1    0
##   med_high   1       6       19    0
##   high       0       0        0   30

Loading the Boston data again and preparing it for clustering

data('Boston')
Boston <- scale(Boston)
dist_eu <- dist(Boston)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Clustering the data with 4 cluter centroids

km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)

Finding the optimal amount of clusters by calculating total of within cluster sum of squares (WCSS) for clusters between 1 to 10.

# calculate WCSS
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Optimal amount of cluster centroids seems to be 2 from the plot above. The best number of clusters is when the total WCSS drops radically.

K-means clustering again but with 2 cluster centroids.

km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

3D plot with crime categories

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
train$crime
##   [1] med_high med_low  high     low      med_low  med_high high    
##   [8] high     low      high     low      med_high med_low  high    
##  [15] high     med_high high     med_low  med_high high     low     
##  [22] high     low      low      low      high     med_high low     
##  [29] low      med_low  med_high low      high     med_high low     
##  [36] med_high low      med_high med_high med_low  med_high med_low 
##  [43] low      med_high med_high med_high high     med_low  med_high
##  [50] med_high med_high high     med_low  high     low      high    
##  [57] high     med_high med_high high     med_low  low      med_low 
##  [64] med_high med_high low      high     low      low      low     
##  [71] low      high     med_low  high     low      low      high    
##  [78] high     high     med_low  low      high     med_high low     
##  [85] med_low  med_low  med_high med_low  med_high med_low  low     
##  [92] low      med_high med_high med_high low      low      med_low 
##  [99] med_high med_low  med_low  high     low      med_low  med_low 
## [106] low      high     med_low  high     med_low  med_high high    
## [113] med_high med_low  med_low  low      med_low  med_high med_high
## [120] high     med_low  high     med_low  med_high high     med_high
## [127] high     high     low      med_high high     high     low     
## [134] high     low      med_low  low      med_low  med_low  med_high
## [141] high     low      med_high high     med_low  low      med_high
## [148] high     low      high     med_low  low      med_high med_low 
## [155] med_low  med_low  med_high low      med_high med_low  med_high
## [162] med_low  med_low  high     high     med_low  med_high low     
## [169] med_low  high     high     med_high med_high high     med_low 
## [176] med_low  high     med_high high     med_high med_high med_low 
## [183] med_low  high     med_low  med_high med_low  low      med_high
## [190] med_high med_high med_low  med_high low      med_high low     
## [197] high     low      med_low  high     med_high med_high med_high
## [204] med_high high     low      med_high med_high med_high high    
## [211] high     med_high high     high     med_high med_low  high    
## [218] low      med_high low      med_low  med_high low      med_low 
## [225] med_high high     high     med_high med_high high     med_high
## [232] high     med_high high     low      med_high med_high med_high
## [239] high     med_high high     med_high med_low  low      low     
## [246] low      low      med_low  low      med_high low      high    
## [253] low      high     med_low  low      med_high med_high med_high
## [260] med_low  high     med_low  low      low      med_low  low     
## [267] med_low  low      med_low  med_low  med_low  med_low  med_low 
## [274] high     med_low  med_low  med_low  low      med_low  high    
## [281] low      med_high med_low  low      med_low  med_low  med_low 
## [288] high     high     med_low  med_low  med_low  low      low     
## [295] med_low  med_high low      med_low  low      med_high med_low 
## [302] high     low      high     med_low  med_low  high     med_high
## [309] low      high     low      med_high high     med_low  low     
## [316] low      high     med_low  high     med_high low      med_low 
## [323] high     med_high high     med_low  med_high med_high high    
## [330] low      low      low      med_high high     low      low     
## [337] med_low  high     low      low      med_high med_high high    
## [344] high     high     high     med_low  med_low  low      med_low 
## [351] med_low  med_high high     high     med_low  high     med_low 
## [358] high     high     med_low  med_low  low      med_high low     
## [365] high     med_low  high     high     med_high high     low     
## [372] med_low  med_high med_low  med_low  high     low      low     
## [379] low      med_high high     med_low  low      low      low     
## [386] high     med_high low      med_high med_high med_high med_low 
## [393] low      med_high high     low      low      high     high    
## [400] med_high med_low  high     high     med_low 
## Levels: low med_low med_high high
lda.pred$class
##   [1] med_low  med_low  med_high low      low      med_low  med_low 
##   [8] med_low  med_low  low      med_low  med_low  low      med_low 
##  [15] med_low  med_low  med_low  med_low  med_low  low      med_low 
##  [22] low      med_low  med_low  med_low  med_low  med_low  med_low 
##  [29] med_low  med_high med_high med_high med_high med_high med_high
##  [36] med_high med_high med_high med_high med_high med_high med_high
##  [43] med_high med_high med_low  med_high med_high low      low     
##  [50] med_high low      low      low      low      low      low     
##  [57] low      low      med_low  low      low      low      med_low 
##  [64] med_low  med_low  med_low  med_low  low      med_low  med_low 
##  [71] low      high     high     high     high     high     high    
##  [78] high     high     high     high     high     high     high    
##  [85] high     high     high     high     high     high     high    
##  [92] high     high     high     high     high     high     high    
##  [99] high     high     high     med_high
## Levels: low med_low med_high high